Improving the I/O Throughput for Data-Intensive Scientific Applications with Efficient Compression Mechanisms

نویسندگان

  • Dongfang Zhao
  • Jian Yin
  • Ioan Raicu
چکیده

Today’s science is generating significantly larger volume of data than before. Data compression can potentially improve application performance. However, in many scientific applications and especially in large scale parallel scientific applications, each process often just accesses parts of the data. This can result in some data that are decompressed by a process but not used. General compression libraries (e.g. LZO [1], bzip2 [2] and zlib [3]) do not consider chunk size in the context of parallel and distributed file systems. Indexing is a widely used technique for online scientific encoding and query e.g. [4–6], even though it would not yield a high compression ratio for large chunks. Some mechanisms [7–9] were proposed by providing user level libraries, indicating the application developers need to modify the application and/or the high-level library. We propose two techniques to leverage compression to improve the performance of large scale parallel applications. First, we enable decompression from the middle of the chunk and stop decompression after we extract the data that we need, which eliminates the overhead of decompressing the data that is not needed by the process. Second, we build compression into the parallel file system which allows caching and prefeteching to be seamlessly integrated and allows applications to transparently leverage compression. Caching decompressed data allows the data to be accessed in later points.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High Throughput Data-Compression for Cloud Storage

As data volumes processed by large-scale distributed dataintensive applications grow at high-speed, an increasing I/O pressure is put on the underlying storage service, which is responsible for data management. One particularly difficult challenge, that the storage service has to deal with, is to sustain a high I/O throughput in spite of heavy access concurrency to massive data. In order to do ...

متن کامل

Requirements of I/O Systems for Parallel Machines: An Application-driven Study

I/O-intensive parallel programs have emerged as one of the leading consumers of cycles on parallel machines. This change has been driven by two trends. First, parallel scientific applications are being used to process larger datasets that do not fit in memory. Second, a large number of parallel machines are being used for non-scientific applications. Efficient execution of these applications re...

متن کامل

Architectural Support for User - Level Input / Output

The performance of the input/output subsystem is becoming increasingly important for many applications. Commercial I/O intensive applications are a fast growing market segment and experience constantly increasing performance demands. Many of these applications exploit concurrency to overlap the latency of I/O operations to improve throughput. At the same time, semiconductor technology trends re...

متن کامل

High-Performance Storage Support for Scientific Big Data Applications on the Cloud

This work studies the storage subsystem for scientific big data applications to be running on the cloud. Although cloud computing has become one of the most popular paradigms for executing data-intensive applications, the storage subsystem has not been optimized for scientific applications. In particular, many scientific applications were originally developed assuming a tightly-coupled cluster ...

متن کامل

Runtime I/O Re-Routing + Throttling on HPC Storage

Massively parallel storage systems are becoming more and more prevalent on HPC systems due to the emergence of a new generation of data-intensive applications. To achieve the level of I/O throughput and capacity that is demanded by data intensive applications, storage systems typically deploy a large number of storage devices (also known as LUNs or data stores). In doing so, parallel applicatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013